Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

نویسندگان

Kazuya Matsumoto

Norihisa Fujita

Toshihiro Hanawa

Taisuke Boku

چکیده

We have been developing a proprietary interconnect technology called Tightly Coupled Accelerators (TCA) architecture to improve communication latency and bandwidth between accelerators (GPUs) over different nodes. This paper presents a Conjugate Gradient (CG) benchmark implementation using the TCA and results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster based on the TCA concept. The implementation is based on the CG benchmark in NAS Parallel Benchmarks, and its parallelization is achieved by a two-dimensional decomposition of matrix data. The TCA utilization improves the communication performance compared with the implementation with MPI/InfiniBand utilization for small size benchmark classes. This study also shows that the CG implementation with the two-dimensional decomposition is more suitable for the TCA utilization than a CG implementation with a one-dimensional decomposition to make use of the interconnect.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementation of NAS Parallel Benchmarks in High Performance Fortran

We present an HPF implementation of BT, SP, LU, FT, CG and MG of the NPB2.3-serial benchmark set. The implementation is based on HPF performance model of the benchmark specific primitive operations with distributed arrays. We present profiling and performance data on SGI Origin 2000 and compare the results with NPB2.3. We discuss advantages and limitations of HPF and pghpf com-

متن کامل

Implementation of the direction of arrival estimation algorithms by means of GPU-parallel processing in the Kuda environment (Research Article)

Direction-of-arrival (DOA) estimation of audio signals is critical in different areas, including electronic war, sonar, etc. The beamforming methods like Minimum Variance Distortionless Response (MVDR), Delay-and-Sum (DAS), and subspace-based Multiple Signal Classification (MUSIC) are the most known DOA estimation techniques. The mentioned methods have high computational complexity. Hence using...

متن کامل

Performance evaluation of CP-PACS on CG benchmark

In this research, we evaluate NAS Parallel Benchmarks ver.1 Kernel CG on massively parallel processor CP-PACS, and analyze the result. CP-PACS' CPU has a special register which is auto-incremented by clock cycle, and we can instrument time spent for any function routine with very high accuracy. As a result of performance analysis, especially for data transfer time, our desk-top estimation ts to...

متن کامل

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...

متن کامل

Analysis of 2D Torus and Hub Topologies of 100Mb/s Ethernet for the Whitney Commodity Computing Testbed

A variety of different network technologies and topologies are currently being evaluated as part of the Whitney Project. This paper reports on the implementation and performance of a Fast Ethernet network configured in a 4x4 2D torus topology in a testbed cluster of “commodity” Pentium Pro PCs. Several benchmarks were used for performance evaluation: an MPI point to point message passing benchm...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

نویسندگان

چکیده

منابع مشابه

Implementation of NAS Parallel Benchmarks in High Performance Fortran

Implementation of the direction of arrival estimation algorithms by means of GPU-parallel processing in the Kuda environment (Research Article)

Performance evaluation of CP-PACS on CG benchmark

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

Analysis of 2D Torus and Hub Topologies of 100Mb/s Ethernet for the Whitney Commodity Computing Testbed

عنوان ژورنال:

اشتراک گذاری